Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pointwise indexing via isel_points method #481

Merged
merged 7 commits into from
Jul 27, 2015
Merged

Add pointwise indexing via isel_points method #481

merged 7 commits into from
Jul 27, 2015

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Jul 20, 2015

This provides behavior equivalent to numpy slicing with multiple lists.

Example

>>> da = xray.DataArray(np.arange(56).reshape((7, 8)), dims=['x', 'y'])
>>> da
<xray.DataArray (x: 7, y: 8)>
array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 8,  9, 10, 11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29, 30, 31],
       [32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47],
       [48, 49, 50, 51, 52, 53, 54, 55]])
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6
  * y        (y) int64 0 1 2 3 4 5 6 7
>>> da.isel_points(x=[0, 1, 6], y=[0, 1, 0])
<xray.DataArray (points: 3)>
array([ 0,  9, 48])
Coordinates:
    y        (points) int64 0 1 0
    x        (points) int64 0 1 6
  * points   (points) int64 0 1 2

related: #475

This provides behavior equivalent to numpy slicing with multiple lists.

Example
-------

>>> da = xray.DataArray(np.arange(56).reshape((7, 8)), dims=['x', 'y'])
>>> da
<xray.DataArray (x: 7, y: 8)>
array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 8,  9, 10, 11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29, 30, 31],
       [32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47],
       [48, 49, 50, 51, 52, 53, 54, 55]])
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6
  * y        (y) int64 0 1 2 3 4 5 6 7
>>> da.isel_points(x=[0, 1, 6], y=[0, 1, 0])
<xray.DataArray (points: 3)>
array([ 0,  9, 48])
Coordinates:
    y        (points) int64 0 1 0
    x        (points) int64 0 1 6
  * points   (points) int64 0 1 2

related: #475

# all the indexers should be iterables
keys = indexers.keys()
indexers = [(k, ([v] if not isinstance(v, Sequence) else v))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to raise an error if someone tries to pass in something that isn't a sequence? I would probably coerce with np.asarray and then raise if v.dtype.kind != 'i' or v.ndim != 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was that you may not know the shape of x and y a priori but am happy to remove this. After thinking about how numpy treats scalar vs array indices (example below), I think it is best to remove this and raise an error.

In [1]: x = np.arange(12).reshape((3, 4))

In [2]: x[2, 3]
Out[2]: 11

In [3]: x[[2], [3]]
Out[3]: array([11]) 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to implement this and I'm not getting it to work. Specifically, how do we want to handle slice objects? As I think about it more, I'm not sure how my first commit was working with slice objects as indexers.

Any thoughts on the best way to handle this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would raise an error with slice objects, too. 1d arrays of integers is enough functionality.

On Wed, Jul 22, 2015 at 9:46 PM, Joe Hamman notifications@github.com
wrote:

  •        an array indexer, in which case the data will be a copy.
    
  •    See Also
    

  •    Dataset.sel
    
  •    DataArray.isel
    
  •    DataArray.sel
    
  •    DataArray.isel_points
    
  •    """
    
  •    invalid = [k for k in indexers if k not in self.dims]
    
  •    if invalid:
    
  •        raise ValueError("dimensions %r do not exist" % invalid)
    
  •    # all the indexers should be iterables
    
  •    keys = indexers.keys()
    
  •    indexers = [(k, ([v] if not isinstance(v, Sequence) else v))
    

    I've been trying to implement this and I'm not getting it to work. Specifically, how do we want to handle slice objects? As I think about it more, I'm not sure how my first commit was working with slice objects as indexers.
    Any thoughts on the best way to handle this?

    Reply to this email directly or view it on GitHub:
    https://github.com/xray/xray/pull/481/files#r35289799

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. That is done in my last commit so no further change needed here.

@shoyer shoyer added this to the 0.6 milestone Jul 20, 2015

actual = da.isel_points(y=y, x=x, dim='test_coord')
assert 'test_coord' in actual.coords
assert actual.coords['test_coord'].shape == (len(y), )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also verify that x and y are still coordinates, along the test_coord dimension.

Probably easier just to construct the expected data-array and then compare them with self.assertDataArrayIdentical.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@shoyer
Copy link
Member

shoyer commented Jul 20, 2015

Very nice! Really looking forward to this one 👍.

return concat([self.isel(**d) for d in
[dict(zip(keys, inds)) for inds in
zip(*[v for k, v in indexers])]],
dim=dim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's explicitly provide the coords and data_vars arguments to concat:

indexer_dims = set(indexers)

def relevant_keys(mapping):
    return [k for k, v in mapping.items()
            if any(d in indexer_dims for d in v.dims)]

data_vars = relevant_keys(self.data_vars)
coords = relevant_keys(self.coords)

This means concat doesn't need to look at any data to figure out which variables should be concatenated (vs. variables which were not indexed).

The test case where would be a dataset that aren't has a constant variable, e.g.,

ds = xray.Dataset({'x': range(5), 'y': 0})

If you do ds.isel_points(x=[0, 1, 2]), y should still be a scalar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test added in test_dataset.py

----------
dim : str or DataArray or pandas.Index, optinal
Name of the dimension to concatenate along. This can either be a
new dimension name, in which case it is added along axis=0, or an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing dimension names are not valid choices for sel_points, I think. Actually, that's probably an edge case worth writing a test for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to update the docstring here.

raise ValueError('All indexers must be the same length')

# Existing dimensions are not valid choices for the dim argument
if dim in self.dims:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail if you pass in an array for the dimension argument -- which also still needs a test.

concat might actually already do reasonable error handling here. If not, you'll need to make sure of _calc_concat_dim_coord: https://github.com/xray/xray/blob/cb51b359d213e711208f2747ffc5ab4acb25dc4d/xray/core/combine.py#L118-L137

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Happy to write that test. However, I'm actually having trouble picturing the use case here. Do you have an example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an xray Dataset ds_stations with station data with dimensions (station, time). Now I pointwise index into a 2D grid with something like: ds_grid.sel_points(latitude=ds_stations.latitude, longitude=ds_stations.longitude, dim=ds_stations.station, method='nearest').

Now the station dimension in the result is labeled by station IDs, not sequential integers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the other hand, I might not even want to need to the dim argument in this case, given that the metadata is already contained in the ds_stations.latitude and ds_stations.longitude DataArray objects.

@jhamman
Copy link
Member Author

jhamman commented Jul 26, 2015

@shoyer - This last commit should handle the DataArrary and list-like objects for the dim argument. Take a look and let me know what you think. It took me a while to wrap my head around the issue, but in the end, I think the solution was pretty straight forward.

@@ -134,6 +135,21 @@ __ http://legacy.python.org/dev/peps/pep-0472/
# this is safe
arr[dict(space=0)] = 0

Pointwise indexing
--------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to have the same number of dashes as the length of the line above.


Parameters
----------
dim : str or DataArray or pandas.Index or other list-like object, optinal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling: optinal -> optional

@shoyer
Copy link
Member

shoyer commented Jul 27, 2015

This needs a couple of small fixes to the docs but otherwise looks ready to merge!

Something to consider: maybe indexing like ds.isel_points(x=stations.x, y=stations.y) should automatically determine the dim argument from the stations.x DataArray?

We could certainly add that later, and it makes more sense once we have sel_points anyways.

@jhamman
Copy link
Member Author

jhamman commented Jul 27, 2015

Okay, thanks. I made some minor revisions to the docs and docstrings.

I think we should wait on the automatic discovery of the dimension name. I think adding sel_points will add a few more use cases for isel_points and may make it clearer how to determine the dim name.

shoyer added a commit that referenced this pull request Jul 27, 2015
Add pointwise indexing via isel_points method
@shoyer shoyer merged commit 69f7386 into pydata:master Jul 27, 2015
@shoyer
Copy link
Member

shoyer commented Jul 27, 2015

woohoo! thanks again for putting this together :)

@jhamman jhamman deleted the feature/isel_points branch July 27, 2015 05:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants